Yelp Data Exploration

This page explores the step-by-step of how we acquired, formatted, and developed the neighborhood amenity guide for Philadelphia. Starting with uploading all of our acquired Yelp data using Yelp Fusion API and matching them to their respective census tract.

Gathering Census Tracts

import geopandas as gpd
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import altair as alt
from sklearn.cluster import KMeans
import re
from wordcloud import WordCloud


# Show all columns in dataframes
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', None)

np.seterr(invalid="ignore");
# phila bounds
phl_bound = gpd.read_file("data/Yelp API/phl_bound.geojson")
phl_neigh = gpd.read_file("https://raw.githubusercontent.com/opendataphilly/open-geo-data/master/Neighborhoods_Philadelphia/Neighborhoods_Philadelphia.geojson")
phl_tract = gpd.read_file("data/Yelp API/census-tracts.geojson")
phl_bound_proj = phl_bound.to_crs(2272)
phl_neigh_proj = phl_neigh.to_crs(2272)
phl_tract_proj = phl_tract.to_crs(2272)
neigh = phl_neigh_proj[["mapname", "geometry"]]
# Copying the DataFrame
tract = phl_tract_proj.copy()

# Concatenating 'neighborhood_name' and 'tract_id' into a new column 'nb_name'
tract['nb_name'] = tract['neighborhood_name'] + " " + tract['tract_id'].astype(str)
tract = tract.copy()
# Calculate area in square meters
tract['area_m2'] = tract['geometry'].area

# Convert area to square miles (1 square mile = 2,589,988.11 square meters)
tract['area_mi2'] = tract['area_m2'] / 27878400
tract = tract[["nb_name", "area_mi2", "geometry"]]

Gathering All Amenities

Our amentities had broad and specific attributes that required classifications for what a potential home-buyer or tourist would seek. Some places would fall into multiple categories, such as bars acting as both entertainment and nightlife.

# amenities
restaurants = gpd.read_file("data/Yelp API/restaurants.geojson")
parks = gpd.read_file("data/Yelp API/parks.geojson")
education = gpd.read_file("data/Yelp API/education.geojson")
grocery = gpd.read_file("data/Yelp API/grocery.geojson")
entertainment = gpd.read_file("data/Yelp API/entertainment.geojson")
nightlife = gpd.read_file("data/Yelp API/nightlife.geojson")
shopping = gpd.read_file("data/Yelp API/shopping.geojson")
healthcare = gpd.read_file("data/Yelp API/healthcare.geojson")
historic = gpd.read_file("data/Yelp API/historic_landmarks.geojson")
kids = gpd.read_file("data/Yelp API/kids.geojson")
arts = gpd.read_file("data/Yelp API/perf_arts.geojson")

beauty = gpd.read_file("data/Yelp API/beauty.geojson")

Data Cleaning

Each category mentioned above includes sub-categories that accurately depict the attributes of the place of interest. The importance of each “alias” is to ensure each place is reflected by all their features. This can result in overlap is amenity counts, but that is acceptable because it increases accuracy in neighborhood reflections and due to businesses often providing more than one good or service.

Restaurants

# Filter and create a copy to avoid SettingWithCopyWarning
food_desc = restaurants.copy()

# Create a new column 'desc_1'
food_desc["desc_1"] = food_desc['alias'].str.split(',').str[0].str.strip().str.lower()
food_desc["desc_2"] = food_desc['alias'].str.split(',').str[1].str.strip().str.lower()
food_desc["desc_3"] = food_desc['alias'].str.split(',').str[2].str.strip().str.lower()

First word in “alias”

# Group by 'desc_1' and count, then convert to DataFrame
food_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_1 count
108 pizza 320
36 chinese 198
118 restaurants 182
45 delis 161
124 seafood 101
17 breakfast_brunch 88
77 hotdogs 84
100 newamerican 81
122 sandwiches 77
58 fooddeliveryservices 75

Second word in “alias”

# Group by 'desc_2' and count, then convert to DataFrame
food_desc.groupby("desc_2").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_2 count
111 sandwiches 120
12 breakfast_brunch 86
64 hotdogs 79
17 burgers 78
112 seafood 73
28 chicken_wings 66
131 tradamerican 65
72 italian 44
98 pizza 42
27 cheesesteaks 41

Third word in “alias”

# Group by 'desc_3' and count, then convert to DataFrame
food_desc.groupby("desc_3").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_3 count
83 sandwiches 124
97 tradamerican 61
20 chicken_wings 41
50 hotdogs 39
10 breakfast_brunch 38
13 burgers 33
25 coffee 30
31 delis 26
84 seafood 26
49 hotdog 23

Replace description 1 with description 2 when description 1 is irrelevant

Words like “restaurant” and “fooddeliveryservices” don’t tell us what kind of food is served. Let’s replace the description in these cases with something more descriptive.

In these cases we can search for restaurants where the first word isn’t descriptive, then replace the desc_1 with desc_2.

# replace desc_2
condition1 = food_desc['desc_2'].isin(["food trucks", "foodtrucks", "fooddeliveryservices", "delis", "bars", "grocery", "pubs", "catering", "convenience", 
                                       "sportsbars", "popuprestaurants", "restaurants", "foodstands", "comfortfood", "food_court", "lounges", "breweries", 
                                       "wine_bars", "bowling", "meats", "newamerican", "tradamerican", "american", "buffets", "foodtrucks"])

condition2 = food_desc['desc_3'].notna()

# Update desc_1 where conditions are met
food_desc.loc[condition1 & condition2, 'desc_2'] = food_desc['desc_3']


# replace desc_1
condition1 = food_desc['desc_1'].isin(["food trucks", "foodtrucks", "fooddeliveryservices", "delis", "bars", "grocery", "pubs", "catering", "convenience", 
                                       "sportsbars", "popuprestaurants", "restaurants", "foodstands", "comfortfood", "food_court", "lounges", "breweries", 
                                       "wine_bars", "bowling", "meats", "newamerican", "tradamerican", "american", "buffets", "foodtrucks"])

condition2 = food_desc['desc_2'].notna()

# Update desc_1 where conditions are met
food_desc.loc[condition1 & condition2, 'desc_1'] = food_desc['desc_2']

food_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head()
desc_1 count
111 pizza 339
37 chinese 203
124 restaurants 182
128 sandwiches 150
130 seafood 126

Unfortunately this didn’t replace “restaurant”. Restaurants with only “restaurants” as the first word in their alias usually don’t have other descriptive words. This shows that “restaurant” is the default. This calls for another approach.

food_desc[food_desc["desc_1"] == "restaurants"].head(5)
name rating address alias title geometry desc_1 desc_2 desc_3
8 Smitty's Restaurant 0.0 5405 Chester Ave restaurants Restaurants POINT (2676904.285 231029.670) restaurants NaN NaN
10 Kan Yuen Ping 0.0 5428 Chester Ave restaurants Restaurants POINT (2676791.167 230669.291) restaurants NaN NaN
11 Silver Place 0.0 1500 S 58th St restaurants Restaurants POINT (2673976.157 231063.740) restaurants NaN NaN
14 Yuan East Restaurant 0.0 1832 S 58th St restaurants Restaurants POINT (2675345.361 229583.321) restaurants NaN NaN
17 Wei Hua 0.0 5140 Chester Ave restaurants Restaurants POINT (2677863.415 231724.048) restaurants NaN NaN

By inspecting the names of restaurants lacking descriptions, we can parse out some common categories.

restaurant_cloud = food_desc[food_desc["desc_1"] == "restaurants"]

# Concatenate all text in the column
text = ' '.join(restaurant_cloud['name'].dropna())

# Create the word cloud
wordcloud = WordCloud(width=800, height=400, background_color ='white').generate(text)

# Display the word cloud using matplotlib
plt.figure(figsize=(10, 5))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

# Define the mapping
keyword_mapping = {
    'yuen': 'chinese',
    'ping': 'chinese',
    'zhong': 'chinese',
    'china': 'chinese',
    'canton': 'chinese',
    'wok': 'chinese',
    'lee': 'chinese',
    'china': 'chinese',
    'zhang': 'chinese',
    'chung': 'chinese',
    'dragon': 'chinese',
    'meng': 'chinese',
    'xin': 'chinese',
    'hua': 'chinese',
    'ping': 'chinese',
    'hua': 'chinese',
    'meng': 'chinese',
    'yi': 'chinese',
    'xi': 'chinese',
    'lam': 'chinese',
    'zhang': 'chinese',
    'wei': 'chinese',
    'salad': 'salad',
    'pizza': 'pizza',
    'grill': 'grill',
    "steak": "grill",
    'sushi': 'japanese',
    'chicken': 'chicken',
    'pizzeria': 'pizza',
    'pizza': 'pizza',
    'bbg': 'bbq',
    'hoagie': 'sandwich',
    'steaks': 'sandwich',
    'coffee': 'cafe', 
    'cafe': 'cafe',
    'diner': 'american',
    'luncheonette': 'american',
    'wings': 'wings',
    'deli': 'deli',
    'mexican': 'latin',
    'guadalupana': 'latin',
    'guatamelteco': 'latin',
    'fish': 'seafood',
    'casa': 'latin'}

# Function to find the keyword and return the corresponding desc_2 value
def map_keyword_to_desc(name, mapping):
    for keyword, desc in mapping.items():
        if re.search(re.escape(keyword), name, re.IGNORECASE):
            return desc
    return None

# Define the list of specific values
specific_values = ["food trucks", "fooddeliveryservices", "bars", "grocery", "restaurants"]

# Define the combined condition: desc_2 is NA or in the list of specific values
condition_combined = food_desc['desc_2'].isna() | food_desc['desc_1'].isin(specific_values)

# Apply the function to update desc_2 based on the combined condition
food_desc.loc[condition_combined, 'desc_1'] = food_desc.loc[condition_combined, 'name'].apply(lambda x: map_keyword_to_desc(x, keyword_mapping))

Reformat into larger categories

# Define the mapping dictionary
desc_mapping = {
    "burgers": "sandwiches",
    "hotdogs": "sandwiches",
    "chicken_wings": "chicken",
    "tradamerican": "american",
    "newamerican": "american",
    "wings": "chicken",
    "hotdog": "diner/grill",
    "gastropubs": "diner/grill",
    "diners": "diner/grill",
    "steak": "diner/grill",
    "grills": "diner/grill",
    "grill": "diner/grill",
    "coffee": "bakeries/cafes",
    'cafe': "bakeries/cafes",
    "cafes": "bakeries/cafes",
    'bagels': "bakeries/cafes",
    'wraps': "bakeries/cafes",
    'soups': "bakeries/cafes",
    'soup': "bakeries/cafes",
    "cupcakes": "bakeries/cafes",
    "creperies": "bakeries/cafes",
    "bakeries": "bakeries/cafes",
    "juicebars": "bakeries/cafes",
    'donuts': "bakeries/cafes",
    'icecream': "bakeries/cafes",
    "ice cream": 'bakeries/cafes',
    "desserts": "bakeries/cafes",
    "waffles": "bakeries/cafes",
    "cakeshop": "bakeries/cafes",
    'customcakes': 'bakeries/cafes',
    "cheesesteaks": "sandwiches/delis",
    "sandwich": "sandwiches/delis",
    "sandwiches": "sandwiches/delis",
    "delis": "sandwiches/delis",
    "deli": "sandwiches/delis",
    "chickenshop": "chicken",
    "sushi": "japanese",
    "ramen": "japanese",
    'hotpot': 'chinese',
    "szechuan": "chinese",
    "cantonese": "chinese",
    "dimsum": "chinese",
    "shanghainese": "chinese",
    "pretzels": "bakeries/cafes",
    "mexican": "latin",
    "honduran": "latin",
    "nicaraguan": "latin",
    "spanish": "latin",
    "guatemalan": "latin",
    "tacos": "latin",
    "brazilian": "latin",
    "portuguese": "latin",
    "dominican": "caribbean",
    "haitian": "caribbean",
    "seafoodmarket,": "seafood",
    "seafoodmarkets,": "seafood",
    "greek": "mediterranean",
    "mideastern": "halal"
}

# Replace desc_1 values using the mapping dictionary
food_desc['desc_1'] = food_desc['desc_1'].replace(desc_mapping)
food_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head()
desc_1 count
70 pizza 326
80 sandwiches/delis 241
8 bakeries/cafes 216
79 sandwiches 172
25 chinese 159

Identify fast food/chain restaurants

Override desc_1 with “fastfood” for major chain restaurants.

food_desc.groupby("name").size().sort_values(ascending = False).head(10)
name
Wawa                         28
McDonald's                   25
Popeyes Louisiana Kitchen    19
Wendy's                      17
Philly Pretzel Factory       13
KFC                          13
Subway                       10
Wingstop                      9
Taco Bell                     9
Crown Fried Chicken           8
dtype: int64
fast_food = [
    "McDonald's",
    "Wawa", 
    "Wingstop",
    "MrBeast Burger",
    "Chick-fil-A",
    "Chipotle Mexican Grill",
    "Checkers",
    "Wendy's",
    "Subway",
    "Popeyes Louisiana Kitchen",
    "Taco Bell",
    "KFC",
    "Burger King",
]
food_desc.loc[food_desc["name"].isin(fast_food), "desc_1"] = "fastfood"

Take the descriptions with frequency < 6 and replace the string with the second term in alias

desc_counts = food_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False)

# Filter for 'desc_1' values where count is less than 6
desc_less_than_six = desc_counts[desc_counts['count'] < 6]['desc_1']

# Convert the filtered results to a list
desc_list = desc_less_than_six.tolist()

# replace desc_1 with desc_2 where desc_1 is in the list of descriptions with frequency < 6
food_desc.loc[food_desc['desc_1'].isin(desc_list), 'desc_1'] = food_desc['desc_2']
food_cats = food_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False)
top_food_cats = list(food_cats[food_cats["count"] >= 12]["desc_1"])
food_desc = food_desc[food_desc["desc_1"].isin(top_food_cats)]
food_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False)
food_desc.head(5)
desc_1 count
18 pizza 327
21 sandwiches/delis 220
2 bakeries/cafes 214
10 fastfood 162
7 chinese 159
22 seafood 92
9 diner/grill 78
20 sandwiches 77
4 breakfast_brunch 76
6 chicken 67
16 latin 63
14 japanese 54
13 italian 42
0 american 38
19 salad 37
23 soulfood 37
11 halal 29
5 caribbean 25
17 mediterranean 23
3 bbq 17
8 cocktailbars 15
12 indpak 13
24 venues 13
25 vietnamese 13
1 asianfusion 12
15 korean 12

Parks

# Filter and create a copy to avoid SettingWithCopyWarning
parks_desc = parks.copy()

# Create a new column 'desc_1'
parks_desc["desc_1"] = parks_desc['alias'].str.split(',').str[0].str.strip().str.lower()
parks_desc["desc_2"] = parks_desc['alias'].str.split(',').str[1].str.strip().str.lower()
parks_desc["desc_3"] = parks_desc['alias'].str.split(',').str[2].str.strip().str.lower()

First word in ‘alias’

# Group by 'desc_1' and count, then convert to DataFrame
parks_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(5)
desc_1 count
19 parks 150
15 landmarks 36
21 playgrounds 25
7 dog_parks 23
13 hiking 17
24 recreation 9
12 gardens 7
25 skate_parks 6
28 tennis 4
11 football 4
3 basketballcourts 2
6 discgolf 2
14 kids_activities 2
17 museums 2
18 nonprofit 2
23 rafting 2
20 pets 1
26 stadiumsarenas 1
22 publicart 1
27 swimmingpools 1
0 amateursportsteams 1
16 localflavor 1
1 bars 1
10 fishing 1
9 farms 1
8 farmersmarket 1
5 climbing 1
4 bikes 1
2 baseballfields 1
29 theater 1

Filter for categories where frequency ≥ 5

park_desc_1 = parks_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(9)
top_park_desc = list(park_desc_1["desc_1"])
# remove historical landmarks
top_park_desc.remove('landmarks')

# filter for obs in the list of top descriptions
parks_desc = parks_desc[parks_desc["desc_1"].isin(top_park_desc)]
parks_desc.head(10)
name rating address alias title geometry desc_1 desc_2 desc_3
0 Cedar Park 4.5 50th St & Baltimore Ave parks Parks POINT (2677006.643 234133.514) parks NaN NaN
1 Chester Avenue Dog Park 3.5 801 S 48th St dog_parks Dog Parks POINT (2678024.964 234174.704) dog_parks NaN NaN
2 Malcolm X Park 4.0 51ST And Pine St parks Parks POINT (2676187.013 235721.062) parks NaN NaN
3 Barkan Park 3.0 4936 Spruce St parks, playgrounds Parks, Playgrounds POINT (2677109.530 236278.545) parks playgrounds NaN
4 Clark Park Dog Bowl 4.5 43rd & Chester dog_parks Dog Parks POINT (2680815.785 234312.734) dog_parks NaN NaN
5 Clark Park 4.5 43RD And Baltimore parks Parks POINT (2680756.007 234729.865) parks NaN NaN
6 Greys Ferry Cresent Skatepark 5.0 3600 Grays Ferry Ave skate_parks Skate Parks POINT (2683004.637 231400.562) skate_parks NaN NaN
7 Grays Ferry Crescent 4.0 parks Parks POINT (2682813.782 232182.835) parks NaN NaN
8 Lanier Dog Park 4.0 2911 Tasker St dog_parks Dog Parks POINT (2685221.027 229056.524) dog_parks NaN NaN
9 Saunder's Green 3.5 300-50 Saunders Ave parks Parks POINT (2683400.920 238616.685) parks NaN NaN

Education

# Filter and create a copy to avoid SettingWithCopyWarning
edu_desc = education.copy()

# Create a new column 'desc_1'
edu_desc["desc_1"] = edu_desc['alias'].str.split(',').str[0].str.strip().str.lower()
edu_desc["desc_2"] = edu_desc['alias'].str.split(',').str[1].str.strip().str.lower()
edu_desc["desc_3"] = edu_desc['alias'].str.split(',').str[2].str.strip().str.lower()

First word in alias

# Group by 'desc_1' and count, then convert to DataFrame
edu_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(5)
desc_1 count
14 childcare 91
29 elementaryschools 86
66 preschools 62
17 collegeuniv 55
43 highschools 33
26 driving_schools 31
28 educationservices 25
75 specialtyschools 23
81 theater 17
6 artschools 15
19 cosmetology_schools 14
83 tutoring 13
3 artclasses 13
27 education 12
22 dance_schools 10
86 vocation 9
23 dancestudio 8
68 privatetutors 7
49 montessori 7
20 cprclasses 7

Filter for frequency ≥ 6

edu_desc_1 = edu_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(18)
top_edu_desc = list(edu_desc_1["desc_1"])
# remove historical landmarks
top_edu_desc.remove('theater')

# filter for obs in the list of top descriptions
edu_desc_1 = edu_desc[edu_desc["desc_1"].isin(top_edu_desc)]
edu_desc.head(10)
name rating address alias title geometry desc_1 desc_2 desc_3
0 BellyWise 5.0 NaN midwives, specialtyschools Midwives, Specialty Schools POINT (2676169.939 231988.984) midwives specialtyschools NaN
1 Limitless CPR 5.0 NaN cprclasses CPR Classes POINT (2677042.522 231394.414) cprclasses NaN NaN
2 Michele Judge 5.0 4919 Pentridge St jewelryrepair, artclasses, jewelry Jewelry Repair, Art Classes, Jewelry POINT (2677304.635 233898.273) jewelryrepair artclasses jewelry
3 Kipp West Philadelphia Prep Charter 1.0 5900 Baltimore Ave specialed Special Education POINT (2671666.162 232636.864) specialed NaN NaN
4 Beulah Baptist Christian Day School 2.0 5001 Spruce St elementaryschools, preschools Elementary Schools, Preschools POINT (2677065.563 236331.076) elementaryschools preschools NaN
5 The Academy of Industrial Arts 5.0 6328 Paschall Ave vocation, electricians Vocational & Technical School, Electricians POINT (2674851.743 225884.662) vocation electricians NaN
6 TYLII 3.0 5124 Walnut St preschools, childcare Preschools, Child Care & Day Care POINT (2676415.568 237130.841) preschools childcare NaN
7 University of the Sciences 3.0 600 S 43rd St collegeuniv Colleges & Universities POINT (2681062.408 233555.127) collegeuniv NaN NaN
8 All Around This World 5.0 4336 Pine St educationservices Educational Services POINT (2680347.395 235381.376) educationservices NaN NaN
9 Jubilee School 3.0 4211 Chester Ave highschools Middle Schools & High Schools POINT (2681043.809 234475.721) highschools NaN NaN

Grocery

# Filter and create a copy to avoid SettingWithCopyWarning
grocery_desc = grocery.copy()

# Create a new column 'desc_1'
grocery_desc["desc_1"] = grocery_desc['alias'].str.split(',').str[0].str.strip().str.lower()
grocery_desc["desc_2"] = grocery_desc['alias'].str.split(',').str[1].str.strip().str.lower()
grocery_desc["desc_3"] = grocery_desc['alias'].str.split(',').str[2].str.strip().str.lower()
grocery_desc.head()
name rating address alias title geometry desc_1 desc_2 desc_3
0 Kim A Grocery & Deli 1.0 1840 S 58th St grocery Grocery POINT (2675396.135 229529.164) grocery NaN NaN
1 S and J Seafood 4.0 713 S 52nd St seafoodmarkets, grocery Seafood Markets, Grocery POINT (2675793.250 234207.384) seafoodmarkets grocery NaN
2 Mariposa Food Co-op 4.0 4824 Baltimore Ave grocery, healthmarkets Grocery, Health Markets POINT (2677563.153 234052.277) grocery healthmarkets NaN
3 Jennifer Grocery 2.5 4824 Chester Ave grocery, convenience Grocery, Convenience Stores POINT (2678754.671 232771.760) grocery convenience NaN
4 Fu-Wah Mini Market 4.5 810 S 47th St convenience, sandwiches, grocery Convenience Stores, Sandwiches, Grocery POINT (2678571.989 234207.924) convenience sandwiches grocery

Nightlife

# Filter and create a copy to avoid SettingWithCopyWarning
night_desc = nightlife.copy()

# Create a new column 'desc_1'
night_desc["desc_1"] = night_desc['alias'].str.split(',').str[0].str.strip().str.lower()
night_desc["desc_2"] = night_desc['alias'].str.split(',').str[1].str.strip().str.lower()
night_desc["desc_3"] = night_desc['alias'].str.split(',').str[2].str.strip().str.lower()
night_desc.head()
name rating address alias title geometry desc_1 desc_2 desc_3
0 Pentridge Station Beer Garden 4.5 5116 Pentridge St beergardens Beer Gardens POINT (2676581.781 233167.788) beergardens NaN NaN
1 720 Bistro 4.0 720 S.52nd st seafood, breakfast_brunch, cocktailbars Seafood, Breakfast & Brunch, Cocktail Bars POINT (2675620.054 234102.746) seafood breakfast_brunch cocktailbars
2 Bayou 4.5 5025 Baltimore Ave lounges Lounges POINT (2676525.799 234061.840) lounges NaN NaN
3 Booker's Restaurant and Bar 3.5 5021 Baltimore Ave bars, breakfast_brunch, tradamerican Bars, Breakfast & Brunch, American POINT (2676559.222 234070.085) bars breakfast_brunch tradamerican
4 The Wine Garden 4.5 5019 Baltimore Ave wine_bars Wine Bars POINT (2676563.551 234060.148) wine_bars NaN NaN

Alias

# Group by 'desc_1' and count, then convert to DataFrame
night_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_1 count
6 bars 172
86 sportsbars 78
75 pubs 71
34 divebars 53
61 lounges 49
100 tradamerican 49
67 newamerican 47
23 cocktailbars 39
47 hookah_bars 28
66 musicvenues 28
# Group by 'desc_2' and count, then convert to DataFrame
night_desc.groupby("desc_2").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_2 count
7 bars 93
103 tradamerican 61
93 sportsbars 48
25 cocktailbars 43
69 newamerican 39
62 lounges 28
86 seafood 28
68 musicvenues 25
80 pubs 24
11 beerbar 19
# Group by 'desc_3' and count, then convert to DataFrame
night_desc.groupby("desc_3").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_3 count
17 cocktailbars 51
70 tradamerican 39
3 bars 24
8 breakfast_brunch 21
6 beerbar 20
72 venues 17
10 burgers 17
56 sandwiches 15
48 musicvenues 12
62 sportsbars 12

Replace desc_1 with more relevant alias terms

# replace desc_2
condition1 = night_desc['desc_2'].isin(["newamerican", "seafood", "breakfast_brunch", "tradamerican", "burgers", "sandwiches", "pizza", "mexican", "italian",
                                       "french", "ramen", "tapasmallplates", "french", "ramen", "mideastern", "comfortfood", "steak", "mediterranean", "caribbean",
                                       "bbq", "peruvian", "japanese", "ethiopian", "tex-mex", "southern", "juicebars", "cajun", "soulfood", "portugeuse", "sushi",
                                       "vegan", "cafes", "diners", "chicken-wings", "chinese", "salad", "cheesesteaks", "african", "german", "desserts", "indpak",
                                       "thai", "tacos", "falafel", "vietnamese", "waffles", "delis", "coffee", "korean", "hotdog", "moroccan", "newmexican", "cuban",
                                       "argentine", "gluten_free", "russian", "bakeries", "portuguese", "spanish", "latin", "catering", "gastropubs", 
                                       "breweries", "gaybars", "whiskeybars", "tikibars", "tobaccoshops", "beer_and_wine", "breweries", "venues",
                                       "poolhalls", "jazzandblues", "airportlounges", "cigarbars", "speakeasies", "theater", "nightlife"])

condition2 = night_desc['desc_3'].notna()

# Update desc_1 where conditions are met
night_desc.loc[condition1 & condition2, 'desc_2'] = night_desc['desc_3']


# replace desc_1
condition1 = night_desc['desc_1'].isin(["newamerican", "seafood", "breakfast_brunch", "tradamerican", "burgers", "sandwiches", "pizza", "mexican", "italian",
                                       "french", "ramen", "tapasmallplates", "french", "ramen", "mideastern", "comfortfood", "steak", "mediterranean", "caribbean",
                                       "bbq", "peruvian", "japanese", "ethiopian", "tex-mex", "southern", "juicebars", "cajun", "soulfood", "portugeuse", "sushi",
                                       "vegan", "cafes", "diners", "chicken-wings", "chinese", "salad", "cheesesteaks", "african", "german", "desserts", "indpak",
                                       "thai", "tacos", "falafel", "vietnamese", "waffles", "delis", "coffee", "korean", "hotdog", "moroccan", "newmexican", "cuban",
                                       "argentine", "gluten_free", "russian", "bakeries", "portuguese", "spanish", "latin", "catering", "gastropubs", 
                                       "breweries", "gaybars", "whiskeybars", "tikibars", "tobaccoshops", "beer_and_wine", "breweries", "venues",
                                       "poolhalls", "jazzandblues", "airportlounges", "cigarbars", "speakeasies", "theater", "nightlife"])

condition2 = night_desc['desc_2'].notna()

# Update desc_1 where conditions are met
night_desc.loc[condition1 & condition2, 'desc_1'] = night_desc['desc_2']

night_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(15)
desc_1 count
4 bars 271
51 sportsbars 114
12 cocktailbars 105
45 pubs 86
39 lounges 64
20 divebars 60
6 beerbar 44
61 wine_bars 42
40 musicvenues 42
31 hookah_bars 36
17 danceclubs 32
34 irish_pubs 14
37 karaoke 13
7 beergardens 12
0 airportlounges 9

Filter for categories where frequency ≥ 8

night_desc_1 = night_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(9)
top_night_desc = list(night_desc_1["desc_1"])

# filter for obs in the list of top descriptions
night_desc = night_desc[night_desc["desc_1"].isin(top_night_desc)]
night_desc.head(10)
name rating address alias title geometry desc_1 desc_2 desc_3
1 720 Bistro 4.0 720 S.52nd st seafood, breakfast_brunch, cocktailbars Seafood, Breakfast & Brunch, Cocktail Bars POINT (2675620.054 234102.746) cocktailbars cocktailbars cocktailbars
2 Bayou 4.5 5025 Baltimore Ave lounges Lounges POINT (2676525.799 234061.840) lounges NaN NaN
3 Booker's Restaurant and Bar 3.5 5021 Baltimore Ave bars, breakfast_brunch, tradamerican Bars, Breakfast & Brunch, American POINT (2676559.222 234070.085) bars tradamerican tradamerican
4 The Wine Garden 4.5 5019 Baltimore Ave wine_bars Wine Bars POINT (2676563.551 234060.148) wine_bars NaN NaN
5 Carbon Copy 4.5 701 S 50th St breweries, pizza, bars Breweries, Pizza, Bars POINT (2676997.957 233927.765) bars bars bars
6 Dock Street Cannery and Tasting Room 4.0 705 S 50th St beerbar, lounges Beer Bar, Lounges POINT (2677046.168 233912.738) beerbar lounges NaN
7 Trendsetters Bar & Lounge 3.5 5301 Woodland Ave burgers, lounges, wraps Burgers, Lounges, Wraps POINT (2678439.957 230236.502) lounges lounges wraps
8 The Barn 3.5 4901 Catharine St poolhalls, sportsbars, divebars Pool Halls, Sports Bars, Dive Bars POINT (2677365.738 234212.623) sportsbars sportsbars divebars
9 Eris Temple 4.0 602 S 52nd St musicvenues Music Venues POINT (2675702.055 234852.608) musicvenues NaN NaN
10 Dahlak 3.5 4708 Baltimore Ave ethiopian, divebars Ethiopian, Dive Bars POINT (2678487.076 234234.642) divebars divebars NaN

Entertainment

# Filter and create a copy to avoid SettingWithCopyWarning
entertain_desc = entertainment.copy()

# Create a new column 'desc_1'
entertain_desc["desc_1"] = entertain_desc['alias'].str.split(',').str[0].str.strip().str.lower()
entertain_desc["desc_2"] = entertain_desc['alias'].str.split(',').str[1].str.strip().str.lower()
entertain_desc["desc_3"] = entertain_desc['alias'].str.split(',').str[2].str.strip().str.lower()
entertain_desc.head()
name rating address alias title geometry desc_1 desc_2 desc_3
0 Noam Osband 5.0 NaN musicians Musicians POINT (2676083.283 232370.886) musicians NaN NaN
1 Ceramic Concept 5.0 5015 Baltimore Ave galleries Art Galleries POINT (2676620.711 234054.592) galleries NaN NaN
2 Eris Temple 4.0 602 S 52nd St musicvenues Music Venues POINT (2675702.055 234852.608) musicvenues NaN NaN
3 Baltimore Avenue Dollar Stroll 5.0 4800 Baltimore Ave unofficialyelpevents, festivals Unofficial Yelp Events, Festivals POINT (2677842.004 234135.971) unofficialyelpevents festivals NaN
4 Curio Theatre Company 4.5 815 S 48th St theater Performing Arts POINT (2678192.432 234160.959) theater NaN NaN

Alias

# Group by 'desc_1' and count, then convert to DataFrame
entertain_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_1 count
130 theater 127
53 galleries 96
50 festivals 69
86 museums 42
42 djs 28
89 musicvenues 28
105 psychics 23
85 movietheaters 20
117 social_clubs 19
47 eventplanning 16
entertain_desc.groupby("desc_2").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_2 count
45 galleries 32
80 musicvenues 25
116 theater 24
119 venues 19
35 dancestudio 14
39 eventplanning 11
34 danceclubs 10
75 movietheaters 10
4 arcades 9
14 bars 8
entertain_desc.groupby("desc_3").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_3 count
59 musicvenues 13
92 venues 9
25 dance_schools 6
27 dancestudio 5
58 musicians 5
34 eventplanning 5
49 kids_activities 5
2 arcades 4
3 artclasses 4
89 theater 4

Replace desc_1 with more relevant alias terms

# replace desc_2
condition1 = entertain_desc['desc_2'].isin(["arts", "musicians", "kids_activities", "venues", "bars", "nonprofit", "lounges"])

condition2 = entertain_desc['desc_3'].notna()

# Update desc_1 where conditions are met
entertain_desc.loc[condition1 & condition2, 'desc_2'] = entertain_desc['desc_3']


# replace desc_1
condition1 = entertain_desc['desc_1'].isin(["arts", "musicians", "kids_activities", "venues", "bars", "nonprofit", "lounges"])

condition2 = entertain_desc['desc_2'].notna()

# Update desc_1 where conditions are met
entertain_desc.loc[condition1 & condition2, 'desc_1'] = entertain_desc['desc_2']

entertain_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(40)
desc_1 count
131 theater 130
55 galleries 98
52 festivals 70
87 museums 42
90 musicvenues 34
43 djs 29
118 social_clubs 24
106 psychics 23
86 movietheaters 21
48 eventplanning 17
39 danceclubs 16
35 countryclubs 11
57 gardens 10
7 artmuseums 10
104 psychic_astrology 9
40 dancestudio 9
95 paintandsip 9
125 stadiumsarenas 9
8 arts 8
17 bowling 8
5 arcades 8
100 photoboothrentals 8
124 sportsteams 7
6 artclasses 7
37 culturalcenter 7
32 comedyclubs 6
105 psychicmediums 6
91 newamerican 6
82 magicians 5
71 jazzandblues 5
27 catering 5
26 casinos 5
15 bingo 5
54 framing 4
122 specialtyschools 4
123 sportsbars 4
0 4
98 partyequipmentrentals 4
28 clowns 4
9 artsandcrafts 3

Filter for frequency

entertain_desc_1 = entertain_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False)
entertain_desc_1 = entertain_desc_1[entertain_desc_1['count'] >= 5]

top_entertain_desc = list(entertain_desc_1["desc_1"])


top_entertain_desc.remove('psychics')

# filter for obs in the list of top descriptions
entertain_desc = entertain_desc[entertain_desc["desc_1"].isin(top_entertain_desc)]
entertain_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False)
entertain_desc.head(5)
desc_1 count
31 theater 130
16 galleries 98
15 festivals 70
21 museums 42
22 musicvenues 34
13 djs 29
28 social_clubs 24
20 movietheaters 21
14 eventplanning 17
11 danceclubs 16
9 countryclubs 11
2 artmuseums 10
17 gardens 10
12 dancestudio 9
24 paintandsip 9
30 stadiumsarenas 9
26 psychic_astrology 9
25 photoboothrentals 8
5 bowling 8
3 arts 8
0 arcades 8
10 culturalcenter 7
1 artclasses 7
29 sportsteams 7
23 newamerican 6
27 psychicmediums 6
8 comedyclubs 6
19 magicians 5
7 catering 5
6 casinos 5
18 jazzandblues 5
4 bingo 5
# Define the mapping dictionary
desc_mapping = {
    "djs": "nightclubs",
    "danceclubs": "nightclubs",
    "artmuseums": "museums",
    "sportsteams": "sports",
    "stadiumsarenas": "sports",
    "paintandsip": "specialtyclasses",
    "dancestudio": "specialtyclasses",
    "bowling": "games/arcade",
    "bingo": "games/arcade",
    "arcades": "games/arcade",
    "artclasses": "specialtyclasses"
}

# Replace desc_1 values using the mapping dictionary
entertain_desc.loc[:, 'desc_1'] = entertain_desc['desc_1'].replace(desc_mapping)
entertain_cat = entertain_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False)
entertain_cat_top = list(entertain_cat[entertain_cat["count"] >= 11]["desc_1"])
# filter for desc_1
entertain_desc = entertain_desc[entertain_desc["desc_1"].isin(entertain_cat_top)]
entertain_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False)
entertain_desc.head(5)
desc_1 count
12 theater 130
3 galleries 98
2 festivals 70
6 museums 52
8 nightclubs 45
7 musicvenues 34
10 specialtyclasses 25
9 social_clubs 24
4 games/arcade 21
5 movietheaters 21
1 eventplanning 17
11 sports 16
0 countryclubs 11

Shopping

# Filter and create a copy to avoid SettingWithCopyWarning
shop_desc = shopping.copy()

# Create a new column 'desc_1'
shop_desc["desc_1"] = shop_desc['alias'].str.split(',').str[0].str.strip().str.lower()
shop_desc["desc_2"] = shop_desc['alias'].str.split(',').str[1].str.strip().str.lower()
shop_desc["desc_3"] = shop_desc['alias'].str.split(',').str[2].str.strip().str.lower()
shop_desc.head()
name rating address alias title geometry desc_1 desc_2 desc_3
0 Rite Aid 2.0 5214-30 Baltimore Ave drugstores, convenience Drugstores, Convenience Stores POINT (2675264.630 233732.096) drugstores convenience NaN
1 Ajah Creative Sweets And Treats & More 5.0 NaN desserts, cupcakes, giftshops Desserts, Cupcakes, Gift Shops POINT (2677939.712 232732.831) desserts cupcakes giftshops
2 Ceramic Concept 5.0 5015 Baltimore Ave galleries Art Galleries POINT (2676620.711 234054.592) galleries NaN NaN
3 VIX Emporium 4.5 5009 Baltimore Ave artsandcrafts, childcloth, jewelry Arts & Crafts, Children's Clothing, Jewelry POINT (2676676.410 234091.659) artsandcrafts childcloth jewelry
4 Michele Judge 5.0 4919 Pentridge St jewelryrepair, artclasses, jewelry Jewelry Repair, Art Classes, Jewelry POINT (2677304.635 233898.273) jewelryrepair artclasses jewelry

Alias

# Group by 'desc_1' and count, then convert to DataFrame
shop_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_1 count
32 drugstores 234
142 womenscloth 114
69 jewelry 90
31 discountstore 87
49 galleries 84
48 furniture 74
113 shoes 68
29 deptstores 56
86 menscloth 50
11 bookstores 41
# Group by 'desc_2' and count, then convert to DataFrame
shop_desc.groupby("desc_2").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_2 count
24 convenience 95
0 accessories 67
139 womenscloth 66
81 menscloth 51
48 furniture 38
65 jewelry 35
100 pharmacy 34
35 discountstore 32
49 galleries 29
61 homedecor 27
# Group by desc_3' and count, then convert to DataFrame
shop_desc.groupby("desc_3").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_3 count
21 convenience 55
0 accessories 30
115 womenscloth 28
54 homedecor 26
79 pharmacy 23
95 sportswear 18
92 shoes 18
56 jewelry 18
15 childcloth 16
108 vintage 15

Filter for most common labels

shop_desc_1 = shop_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(20)
top_shop_desc = list(shop_desc_1["desc_1"])
top_shop_desc
['drugstores',
 'womenscloth',
 'jewelry',
 'discountstore',
 'galleries',
 'furniture',
 'shoes',
 'deptstores',
 'menscloth',
 'bookstores',
 'accessories',
 'thrift_stores',
 'antiques',
 'grocery',
 'giftshops',
 'homedecor',
 'selfstorage',
 'childcloth',
 'sportswear',
 'vintage']
# remove drugstores and grocery
top_shop_desc.remove('drugstores')
top_shop_desc.remove('grocery')

# filter for obs in the list of top descriptions
shop_desc = shop_desc[shop_desc["desc_1"].isin(top_shop_desc)]
# Define the mapping dictionary
desc_mapping = {
    "mattresses": "furniture",
    "kitchenandbath": "furniture",
    "rugs": "homedecor",
    "hats": "accessories",
    "watch_repair": "accessories"
}

# Replace desc_1 values using the mapping dictionary
shop_desc.loc[:, 'desc_1'] = shop_desc['desc_1'].replace(desc_mapping)
shop_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(20)
desc_1 count
17 womenscloth 114
10 jewelry 90
5 discountstore 87
7 galleries 84
6 furniture 74
13 shoes 68
4 deptstores 56
11 menscloth 50
2 bookstores 41
0 accessories 40
15 thrift_stores 36
1 antiques 31
8 giftshops 24
12 selfstorage 23
9 homedecor 23
3 childcloth 21
14 sportswear 20
16 vintage 19

Healthcare

# Filter and create a copy to avoid SettingWithCopyWarning
health_desc = healthcare.copy()

# Create a new column 'desc_1'
health_desc["desc_1"] = health_desc['alias'].str.split(',').str[0].str.strip().str.lower()
health_desc["desc_2"] = health_desc['alias'].str.split(',').str[1].str.strip().str.lower()
health_desc["desc_3"] = health_desc['alias'].str.split(',').str[2].str.strip().str.lower()

Alias

# Group by 'desc_1' and count, then convert to DataFrame
health_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_1 count
38 laboratorytesting 33
80 urgent_care 26
33 homehealthcare 25
43 medcenters 23
29 generaldentistry 21
35 hospitals 20
37 internalmed 19
21 drugstores 19
26 familydr 16
11 chiropractors 14

Filter for most common labels

health_desc_1 = health_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(20)
top_health_desc = list(shop_desc_1["desc_1"])
top_health_desc
['drugstores',
 'womenscloth',
 'jewelry',
 'discountstore',
 'galleries',
 'furniture',
 'shoes',
 'deptstores',
 'menscloth',
 'bookstores',
 'accessories',
 'thrift_stores',
 'antiques',
 'grocery',
 'giftshops',
 'homedecor',
 'selfstorage',
 'childcloth',
 'sportswear',
 'vintage']
# filter for obs in the list of top descriptions
health_desc = health_desc[health_desc["desc_1"].isin(top_health_desc)]

Historic landmarks

# Filter and create a copy to avoid SettingWithCopyWarning
historic_desc = historic.copy()

# Create a new column 'desc_1'
historic_desc["desc_1"] = historic_desc['alias'].str.split(',').str[0].str.strip().str.lower()
historic_desc["desc_2"] = historic_desc['alias'].str.split(',').str[1].str.strip().str.lower()
historic_desc["desc_3"] = historic_desc['alias'].str.split(',').str[2].str.strip().str.lower()

Alias

# Group by 'desc_1' and count, then convert to DataFrame
historic_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_1 count
1 landmarks 46
4 museums 25
0 artmuseums 2
2 libraries 2
9 venues 2
3 municipality 1
5 nonprofit 1
6 postoffices 1
7 publicart 1
8 religiousorgs 1

Filter for most common labels

historic_desc_1 = historic_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(3)
top_historic_desc = list(historic_desc_1["desc_1"])
top_historic_desc
['landmarks', 'museums', 'artmuseums']
# filter for obs in the list of top descriptions
historic_desc = historic_desc[historic_desc["desc_1"].isin(top_historic_desc)]
# Define the mapping dictionary
desc_mapping = {
    "artmuseums": "museums"
}

# Replace desc_1 values using the mapping dictionary
historic_desc['desc_1'] = historic_desc['desc_1'].replace(desc_mapping)
historic_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(20)
desc_1 count
0 landmarks 46
1 museums 27

Kids activities

kids_desc = kids.copy()

# Create a new column 'desc_1'
kids_desc["desc_1"] = kids_desc['alias'].str.split(',').str[0].str.strip().str.lower()
kids_desc["desc_2"] = kids_desc['alias'].str.split(',').str[1].str.strip().str.lower()
kids_desc["desc_3"] = kids_desc['alias'].str.split(',').str[2].str.strip().str.lower()

kids_desc.head()
name rating address alias title geometry desc_1 desc_2 desc_3
0 Makono African Hair Braiding 3.0 5529 Baltimore Ave hairstylists, cosmetics Hair Stylists, Cosmetics & Beauty Supply POINT (2673677.520 233714.490) hairstylists cosmetics NaN
1 Carbon Copy 4.5 701 S 50th St breweries, pizza, bars Breweries, Pizza, Bars POINT (2676997.957 233927.765) breweries pizza bars
2 Firehouse Bicycles 4.5 701 S 50th St bikes, bike_repair_maintenance Bikes, Bike Repair/Maintenance POINT (2677005.367 233951.502) bikes bike_repair_maintenance NaN
3 Common Beat Music 4.0 4916 Baltimore Ave vinyl_records, electronicsrepair, musicinstrumentservices Vinyl Records, Electronics Repair, Musical Instrument Services POINT (2677188.738 233998.088) vinyl_records electronicsrepair musicinstrumentservices
4 Cedar Park 4.5 50th St & Baltimore Ave parks Parks POINT (2677006.643 234133.514) parks NaN NaN

Alias

# Group by 'desc_1' and count, then convert to DataFrame
kids_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_1 count
45 childcare 92
232 parks 53
131 generaldentistry 39
87 drugstores 36
141 gyms 34
255 preschools 29
311 theater 26
95 elementaryschools 26
189 libraries 26
107 festivals 25
kids_desc = kids_desc[kids_desc["desc_1"].isin(["childcare", "parks", "preschools", "theater", "libraries", "festivals", "museums", 
                                                "martialarts", "churches", "playgrounds", "musicalinstrumentsandteachers", "landmarks", 
                                                "kids_activities"])]
kids_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False)
desc_1 count
0 childcare 92
9 parks 53
11 preschools 29
5 libraries 26
12 theater 26
2 festivals 25
7 museums 19
1 churches 18
6 martialarts 18
10 playgrounds 18
8 musicalinstrumentsandteachers 17
4 landmarks 15
3 kids_activities 13

Art and music

arts_desc = arts.copy()

# Create a new column 'desc_1'
arts_desc["desc_1"] = arts_desc['alias'].str.split(',').str[0].str.strip().str.lower()
arts_desc["desc_2"] = arts_desc['alias'].str.split(',').str[1].str.strip().str.lower()
arts_desc["desc_3"] = arts_desc['alias'].str.split(',').str[2].str.strip().str.lower()

arts_desc.head()
name rating address alias title geometry desc_1 desc_2 desc_3
0 Curio Theatre Company 4.5 815 S 48th St theater Performing Arts POINT (2678192.432 234160.959) theater NaN NaN
1 Studio 34 4.5 4522 Baltimore Ave yoga, pilates, theater Yoga, Pilates, Performing Arts POINT (2679553.196 234477.061) yoga pilates theater
2 Painted Bride Art Center 3.5 5212 Market St theater, venues Performing Arts, Venues & Event Spaces POINT (2676029.704 238278.040) theater venues NaN
3 The Rotunda 4.5 4014 Walnut St musicvenues, theater Music Venues, Performing Arts POINT (2682266.656 236480.866) musicvenues theater NaN
4 Penn Live Arts 4.0 3680 Walnut St theater, jazzandblues, musicvenues Performing Arts, Jazz & Blues, Music Venues POINT (2684326.329 236037.832) theater jazzandblues musicvenues

Alias

# Group by 'desc_1' and count, then convert to DataFrame
arts_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_1 count
27 theater 163
7 dancestudio 11
18 musicalinstrumentsandteachers 11
12 galleries 9
20 musicvenues 6
19 musicians 5
11 festivals 5
6 dance_schools 5
3 artschools 4
1 artmuseums 4

Regroup into large categories

# Define the mapping dictionary
desc_mapping = {
    "specialtyschools": "classes",
    "dance_schools": "classes",
    "artschools": "classes",
    "musicalinstrumentsandteachers": "classes",
    "dancestudio": "dancestudios",
    "galleries": "gallery/museum",
    "museums": "gallery/museum",
    "artmuseums": "gallery/museum",
    "festivals": "events/shows",
    "jazzandblues": "events/shows",
    "djs": "events/shows",
    "musicvenues": "events/shows",
    "comedyclubs": "events/shows",
    "magicians": "events/shows",
    "musicians": "events/shows"
}

# Replace desc_1 values using the mapping dictionary
arts_desc['desc_1'] =arts_desc['desc_1'].replace(desc_mapping)
arts_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(20)
desc_1 count
16 theater 163
7 events/shows 24
2 classes 22
8 gallery/museum 15
4 dancestudios 11
3 culturalcenter 2
0 aerialfitness 1
12 opera 1
17 venues 1
15 summer_camps 1
14 social_clubs 1
13 preschools 1
9 hair 1
11 nonprofit 1
10 language_schools 1
1 arts 1
6 educationservices 1
5 dinnertheater 1
18 yoga 1

Filter for most common labels

arts_desc_1 = arts_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(20)
top_arts_desc = list(arts_desc_1["desc_1"])

# remove historical landmarks
top_arts_desc.remove('summer_camps')
top_arts_desc.remove('preschools')
top_arts_desc.remove('language_schools')
top_arts_desc.remove('educationservices')

top_arts_desc
['theater',
 'events/shows',
 'classes',
 'gallery/museum',
 'dancestudios',
 'culturalcenter',
 'aerialfitness',
 'opera',
 'venues',
 'social_clubs',
 'hair',
 'nonprofit',
 'arts',
 'dinnertheater',
 'yoga']
arts_desc = arts_desc[arts_desc["desc_1"].isin(top_arts_desc)]

Beauty

beauty_desc = beauty.copy()

# Create a new column 'desc_1'
beauty_desc["desc_1"] = beauty_desc['alias'].str.split(',').str[0].str.strip().str.lower()
beauty_desc["desc_2"] = beauty_desc['alias'].str.split(',').str[1].str.strip().str.lower()
beauty_desc["desc_3"] = beauty_desc['alias'].str.split(',').str[2].str.strip().str.lower()

beauty_desc.tail()
name rating address alias title geometry desc_1 desc_2 desc_3
2156 Lips and Drips by Erica Marie 5.0 2342 S Broad St medicalspa, ivhydration Medical Spas, IV Hydration POINT (2691839.535 224599.252) medicalspa ivhydration NaN
2157 The Lash Lounge Wyndmoor - 909 Willow 4.0 909 E Willow Grove Ave eyelashservice, eyebrowservices, threadingservices Eyelash Service, Eyebrow Services, Threading Services POINT (2685044.968 283383.097) eyelashservice eyebrowservices threadingservices
2158 Hair Expo 3.5 7721 Crittenden St hairstylists Hair Stylists POINT (2683447.588 279731.355) hairstylists NaN NaN
2159 The Cut Bar 4.0 532 S 4th St barbers Barbers POINT (2697625.976 232400.348) barbers NaN NaN
2160 House of Hair Lounge 3.0 259 S 60th St hairstylists Hair Stylists POINT (2671622.181 237231.748) hairstylists NaN NaN
# Group by 'desc_1' and count, then convert to DataFrame
beauty_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending = False).head(10)
desc_1 count
85 hair 378
146 othersalons 342
12 barbers 149
89 hairstylists 87
43 cosmetics 82
119 massage 74
70 florists 53
117 makeupartists 49
86 hair_extensions 35
188 tattoo 33
remove_cat = ["florists", "jewelry", "drugstores", "generaldentistry", "discountstore", "womenscloth", 
              "deptstores", "furniture", "", "bookstores", "galleries", "giftshops", "artsandcrafts", 
              "gardening", "fabricstores", "antiques", "menscloth", "orthodontists", "vintage", 
              "thriftstores", "stationary", "bikes", "beautysvc", "accessories", "optometrists", 
              "framing", "interiordesign", "tobaccoshops", "homedecor", "stationary", "thriftstores"]

# Grouping, counting, and filtering beauty_desc
beauty_desc_1 = beauty_desc.groupby("desc_1").size().reset_index(name='count').sort_values("count", ascending=False)
top_beauty_desc = beauty_desc_1[beauty_desc_1['count'] > 6]

# Filtering top_beauty_desc to remove unwanted categories
top_beauty_desc = top_beauty_desc[~top_beauty_desc["desc_1"].isin(remove_cat)]["desc_1"]

beauty_desc = beauty_desc[beauty_desc["desc_1"].isin(top_beauty_desc)]

Combine all amenities

# Create explicit copies of the DataFrame slices
food_desc = food_desc.copy()
parks_desc = parks_desc.copy()
edu_desc = edu_desc.copy()
grocery_desc = grocery_desc.copy()
entertain_desc = entertain_desc.copy()
night_desc = night_desc.copy()
shop_desc = shop_desc.copy()
health_desc = health_desc.copy()
historic_desc = historic_desc.copy()
arts_desc = arts_desc.copy()
beauty_desc = beauty_desc.copy()
kids_desc = kids_desc.copy()

# Now perform the operations
food_desc.loc[:, 'type'] = 'restaurant'
parks_desc.loc[:, 'type'] = 'parks'
edu_desc.loc[:, 'type'] = 'education'
grocery_desc.loc[:, 'type'] = 'grocery'
entertain_desc.loc[:, 'type'] = 'entertainment'
night_desc.loc[:, 'type'] = 'nightlife'
shop_desc.loc[:, 'type'] = 'shopping'
health_desc.loc[:, 'type'] = 'healthcare'
historic_desc.loc[:, 'type'] = 'historic'
arts_desc.loc[:, 'type'] = 'arts'
beauty_desc.loc[:, 'type'] = 'beauty'
kids_desc.loc[:, 'type'] = 'kids'
# combine gdfs
amenities = gpd.GeoDataFrame(pd.concat([food_desc, parks_desc, edu_desc, grocery_desc, entertain_desc, night_desc, shop_desc,
                                       health_desc, historic_desc, arts_desc, beauty_desc, kids_desc], ignore_index=True))

Clip to Philly bounds

phl_bound_squeezed = phl_bound_proj.squeeze().geometry

amenities_phl = amenities.within(phl_bound_squeezed)
amenities_phl_gdf = amenities[amenities_phl]
amenities_phl_gdf.explore(tiles='CartoDB dark_matter', legend=True, column='type', cmap='Paired')
Make this Notebook Trusted to load map: File -> Trust Notebook
# count by type
amenities_grouped = amenities_phl_gdf.groupby('type').size().reset_index(name='count')

Amenities spatial patterns

# Extract unique business types
business_types = amenities_phl_gdf['type'].unique()

# Determine the number of rows and columns for the subplots
n_rows = len(business_types) // 3 + (len(business_types) % 3 > 0)
fig, axes = plt.subplots(n_rows, 3, figsize=(15, n_rows * 4))

# Flatten the axes array for easy looping
axes = axes.flatten()

# Create a map for each business type
for i, business_type in enumerate(business_types):
    # Filter data for the current business type
    subset = amenities_phl_gdf[amenities_phl_gdf['type'] == business_type]

    # Get count for the current business type
    count = amenities_grouped[amenities_grouped['type'] == business_type]['count'].values[0]

    # Plotting with transparency
    subset.plot(ax=axes[i], color='navy', markersize=1, alpha=1)

    # Set title with count (n = count)
    axes[i].set_title(f"{business_type.capitalize()} (n = {count})")

    # Customizations: Remove boxes, axis ticks, and labels
    axes[i].set_axis_off()

# Remove unused subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

# Adjust layout
plt.tight_layout()

# Display the panel of maps
plt.show()

Intersect with neighborhoods

amenities_neighborhood = gpd.sjoin(amenities_phl_gdf, tract, how = "left", predicate = "intersects")
amenities_neighborhood.head()
name rating address alias title geometry desc_1 desc_2 desc_3 type index_right nb_name area_mi2
0 Rodflash Smoked BBQ 0.0 popuprestaurants Pop-Up Restaurants POINT (2676022.529 232162.402) other NaN NaN restaurant 69 Paschall-Kingsessing 08 0.160526
1 Gunpowder Cafe 3.0 NaN fooddeliveryservices, indpak, cafes Food Delivery Services, Indian, Cafes POINT (2676022.529 232162.402) indpak indpak cafes restaurant 69 Paschall-Kingsessing 08 0.160526
2 Lucky Chinese Restaurant 0.0 5541 Chester Ave chinese Chinese POINT (2676286.873 230415.145) other NaN NaN restaurant 68 Paschall-Kingsessing 07 0.097782
3 J & J Restaurant 5.0 5513 Chester Ave caribbean Caribbean POINT (2676518.230 230597.923) other NaN NaN restaurant 68 Paschall-Kingsessing 07 0.097782
4 Thunka Kitchen 4.0 5629 Chester Ave african African POINT (2675970.967 230117.411) other NaN NaN restaurant 68 Paschall-Kingsessing 07 0.097782
amenities_neigh_group = amenities_neighborhood.groupby(["nb_name", "type"]).size().reset_index(name='count')
amenities_neigh_group.head()
nb_name type count
0 Airport-Navy Yard 01 entertainment 2
1 Airport-Navy Yard 01 kids 2
2 Airport-Navy Yard 01 nightlife 1
3 Airport-Navy Yard 01 restaurant 4
4 Airport-Navy Yard 02 beauty 1
amenities_neigh_total = amenities_neigh_group.groupby("nb_name")['count'].sum().reset_index(name='total_amenities')
amenities_neigh_total.head()
nb_name total_amenities
0 Airport-Navy Yard 01 9
1 Airport-Navy Yard 02 69
2 Airport-Navy Yard 03 53
3 Airport-Navy Yard 04 46
4 Airport-Navy Yard 05 57
amenities_neigh_group = amenities_neigh_group.merge(amenities_neigh_total, on="nb_name", how="left")
amenities_neigh_group.head()
nb_name type count total_amenities
0 Airport-Navy Yard 01 entertainment 2 9
1 Airport-Navy Yard 01 kids 2 9
2 Airport-Navy Yard 01 nightlife 1 9
3 Airport-Navy Yard 01 restaurant 4 9
4 Airport-Navy Yard 02 beauty 1 69
amenities_neigh_group["pct_share"] = 100*amenities_neigh_group["count"]/amenities_neigh_group["total_amenities"]
amenities_neigh_group.head()
nb_name type count total_amenities pct_share
0 Airport-Navy Yard 01 entertainment 2 9 22.222222
1 Airport-Navy Yard 01 kids 2 9 22.222222
2 Airport-Navy Yard 01 nightlife 1 9 11.111111
3 Airport-Navy Yard 01 restaurant 4 9 44.444444
4 Airport-Navy Yard 02 beauty 1 69 1.449275
amenities_neigh_group_gdf = amenities_neigh_group.merge(tract, on = "nb_name", how = "left")

amenities_neigh = gpd.GeoDataFrame(amenities_neigh_group_gdf, geometry=‘geometry’)

amenities_nb = amenities_neigh[["nb_name", "type", "count", "total_amenities", "pct_share", "geometry", "area_mi2"]]
amenities_nb = amenities_nb.copy()
amenities_nb.loc[:, 'count_per_mi2'] = amenities_nb['count'] / amenities_nb['area_mi2']
amenities_nb[amenities_nb["type"] == "kids"].explore(tiles='CartoDB dark_matter', legend=True, column = "pct_share", cmap = "magma")
Make this Notebook Trusted to load map: File -> Trust Notebook
# Extract unique types
amenity_types = amenities_nb['type'].unique()

# Determine the number of rows and columns for the subplots
n_rows = len(amenity_types) // 3 + (len(amenity_types) % 3 > 0)
fig, axes = plt.subplots(n_rows, 3, figsize=(15, n_rows * 4))

# Flatten the axes array for easy looping
axes = axes.flatten()

# Create a choropleth map for each amenity type
for i, amenity_type in enumerate(amenity_types):
    # Filter data for the current amenity type
    subset = amenities_nb[amenities_nb['type'] == amenity_type]

    # Plotting
    subset.plot(column='pct_share', ax=axes[i], legend=True,
                legend_kwds={'label': "share of all amenities"},
                cmap='YlGnBu')

    # Set title
    axes[i].set_title(amenity_type.capitalize())
    
     # Remove boxes, axis ticks, and axis labels
    axes[i].set_axis_off()

# Remove unused subplots
for j in range(i+1, len(axes)):
    fig.delaxes(axes[j])

# Adjust layout
plt.tight_layout()

# Display the panel of maps
plt.show()

Correlation analysis

# spreading the data
amenities_nb_wide = amenities_nb.pivot_table(index='nb_name', columns='type', values='pct_share', aggfunc=np.mean).fillna(0)
# Calculating the correlation matrix
correlation_matrix = amenities_nb_wide.corr()

import altair as alt

# Reset index to convert the index into a column for Altair
heatmap_data = correlation_matrix.reset_index().melt('type', var_name='type2', value_name='correlation')

# Create the heatmap
heatmap = alt.Chart(heatmap_data).mark_rect().encode(
    x='type:N',
    y='type2:N',
    color='correlation:Q'
)

# Add text to each cell
text = heatmap.mark_text(baseline='middle').encode(
    text=alt.Text('correlation:Q', format='.2f'),
    color=alt.condition(
        alt.datum.correlation > 0.5, 
        alt.value('white'),
        alt.value('black')
    )
)

# Display the chart
chart = (heatmap + text).properties(width=600, height=600)
chart

Cluster analysis

from sklearn.preprocessing import MinMaxScaler, RobustScaler
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

Overall (all features)

kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
amenities_nb_wide = amenities_nb.pivot_table(index='nb_name', columns='type', values='pct_share', fill_value=0).reset_index()
amenities_nb_scaled = scaler.fit_transform(amenities_nb_wide[["arts", "education", "kids", "nightlife", "parks", "restaurant", "shopping"]])
amenities_nb_scaled[:5]
array([[-0.38932863, -0.60659512,  1.59378172,  0.24376265, -0.3493857 ,
         0.36173374, -0.69059262],
       [-0.38932863, -0.60659512, -0.46384369,  0.28301402, -0.3493857 ,
         1.55557841,  0.01792224],
       [ 0.00852099, -0.60659512, -0.28913964,  1.18061134, -0.3493857 ,
         0.53185073, -0.69059262],
       [-0.38932863, -0.44980285,  0.14002464, -0.30575651, -0.01282637,
        -0.39555578,  2.14346683],
       [-0.01939828, -0.22699277, -0.30139958, -0.08884105, -0.3493857 ,
         0.85784044,  0.59592121]])
amenities_nb_scaled.mean(axis=0)
array([ 3.59101787e-17, -4.30922144e-17, -1.43640715e-17,  8.61844289e-17,
        2.39401191e-17, -7.06233514e-17,  8.25934110e-17])
amenities_nb_scaled.std(axis=0)
array([1., 1., 1., 1., 1., 1., 1.])
# Perform the fit
kmeans.fit(amenities_nb_scaled)

# Extract the labels
amenities_nb_wide['label'] = kmeans.labels_
(
    alt.Chart(amenities_nb_wide)
    .mark_circle()
    .encode(
        alt.X("restaurant:Q", scale=alt.Scale(zero=False)),
        alt.Y("education:Q", scale=alt.Scale(zero=False)),
        color=alt.Color("label:N", scale=alt.Scale(scheme="dark2")),
        tooltip=list(amenities_nb_wide.columns),
    )
    .properties(width=800, height=600)
    .interactive()
)
amenities_nb_wide.groupby("label").size()
label
0     49
1     29
2     48
3    220
4     25
dtype: int64
# Average restaurants per cluster
amenities_nb_wide.groupby("label")['restaurant'].mean().sort_values()
label
2     8.320734
0     8.833224
1    12.488124
4    18.099593
3    50.925199
Name: restaurant, dtype: float64
# Average educational institutions per cluster
amenities_nb_wide.groupby("label")['parks'].mean().sort_values()
label
2     0.614781
3     0.775837
1     0.852458
0     2.410477
4    19.769207
Name: parks, dtype: float64
amenities_clusters = amenities_nb_wide.merge(tract, how = "left", on = "nb_name")
amenities_clusters_gdf = gpd.GeoDataFrame(amenities_clusters, geometry = "geometry")
amenities_clusters_gdf.explore(
    column="label",
    cmap="Set3",
    categorical=True,
    legend=True,
    tiles="CartoDB positron"
)
Make this Notebook Trusted to load map: File -> Trust Notebook
amenities_clusters_gdf.to_file('clusters.geojson', driver='GeoJSON', index=False)
cluster_profile = amenities_nb_wide.groupby("label")[["arts", "education", "kids", "nightlife", "parks", "restaurant", "shopping"]].mean()

cluster_profile
type arts education kids nightlife parks restaurant shopping
label
0 10.122255 10.110204 4.389460 21.820754 2.410477 8.833224 7.430543
1 0.567573 42.981270 8.490453 2.605510 0.852458 12.488124 2.235311
2 0.905578 5.607784 2.897750 10.489589 0.614781 8.320734 31.978278
3 0.529042 4.282592 2.856462 5.174781 0.775837 50.925199 5.233768
4 0.508021 6.682243 25.187898 8.897605 19.769207 18.099593 1.166982
# Number of clusters to try out
n_clusters = list(range(2, 15))

# Run kmeans for each value of k
inertias = []
for k in n_clusters:
    
    # Initialize and run
    kmeans = KMeans(n_clusters=k, n_init=20)
    kmeans.fit(amenities_nb_scaled)
    
    # Save the "inertia"
    inertias.append(kmeans.inertia_)
    
# Plot it!
plt.plot(n_clusters, inertias, marker='o', ms=10, mfc='white', lw=4, mew=3);

from kneed import KneeLocator

# Initialize the knee algorithm
kn = KneeLocator(n_clusters, inertias, curve='convex', direction='decreasing')

# Print out the knee 
print(kn.knee)
8

Restaurants

kmeans = KMeans(n_clusters=9, random_state=42, n_init=10)
restaurant = food_desc[["name", "desc_1"]]
nb_gdf = amenities_neighborhood[["nb_name", "name", "rating"]]

restaurants_nb_gdf = restaurant.merge(nb_gdf, on = "name", how = "left")
restaurants_nb_grouped = restaurants_nb_gdf.groupby(["nb_name", "desc_1"]).size().reset_index(name="count")

# Sort within each neighborhood group by count in descending order
restaurants_nb_grouped = restaurants_nb_grouped.sort_values(by=["nb_name", "count"], ascending=[True, False])
restaurants_nb_grouped.head()
nb_name desc_1 count
0 Airport-Navy Yard 01 italian 2
1 Airport-Navy Yard 01 pizza 1
2 Airport-Navy Yard 01 sandwiches/delis 1
8 Airport-Navy Yard 02 fastfood 46
4 Airport-Navy Yard 02 bakeries/cafes 17
restaurants_nb_wide = restaurants_nb_grouped.pivot(index='nb_name', columns='desc_1', values='count').reset_index().fillna(0)

# Calculating the correlation matrix
correlation_matrix = restaurants_nb_wide.corr()

import altair as alt

# Reset index to convert the index into a column for Altair
heatmap_data = correlation_matrix.reset_index().melt('desc_1', var_name='desc_2', value_name='correlation')
/var/folders/h_/pxhj6fqj4y3fqkpmr3g9dd3c0000gq/T/ipykernel_3626/2902381521.py:4: FutureWarning: The default value of numeric_only in DataFrame.corr is deprecated. In a future version, it will default to False. Select only valid columns or specify the value of numeric_only to silence this warning.
  correlation_matrix = restaurants_nb_wide.corr()
# Calculating the total restaurants for each neighborhood
restaurants_total = restaurants_nb_wide.set_index('nb_name').sum(axis=1)

# Using 'map' to align and assign the total restaurants back to the original DataFrame
restaurants_nb_wide["total_restaurants"] = restaurants_nb_wide['nb_name'].map(restaurants_total)

restaurants_nb_wide.head()
desc_1 nb_name american asianfusion bakeries/cafes bbq breakfast_brunch caribbean chicken chinese cocktailbars diner/grill fastfood halal indpak italian japanese korean latin mediterranean pizza salad sandwiches sandwiches/delis seafood soulfood venues vietnamese total_restaurants
0 Airport-Navy Yard 01 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 4.0
1 Airport-Navy Yard 02 2.0 0.0 17.0 0.0 6.0 0.0 2.0 0.0 0.0 3.0 46.0 0.0 0.0 4.0 2.0 0.0 4.0 3.0 4.0 11.0 2.0 6.0 7.0 0.0 0.0 0.0 119.0
2 Airport-Navy Yard 03 4.0 2.0 0.0 0.0 0.0 0.0 2.0 0.0 2.0 4.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 6.0 0.0 5.0 3.0 3.0 0.0 0.0 0.0 31.0
3 Airport-Navy Yard 04 2.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.0 1.0 0.0 0.0 8.0
4 Airport-Navy Yard 05 5.0 0.0 6.0 0.0 2.0 0.0 0.0 2.0 0.0 1.0 67.0 0.0 0.0 3.0 2.0 0.0 3.0 0.0 0.0 6.0 2.0 0.0 0.0 0.0 2.0 0.0 101.0
# Columns representing different cuisines
cuisine_columns = ['bakeries/cafes', 
                   'chicken', 'chinese', 'diner/grill', 
                   'fastfood', "korean",
                   'pizza', 
                   'sandwiches/delis', 'vietnamese']

# Convert cuisine counts to percentages of the total restaurants
for column in cuisine_columns:
    restaurants_nb_wide[column] = ((restaurants_nb_wide[column] / restaurants_nb_wide['total_restaurants']) * 100).round(2)

# Check the updated DataFrame
restaurants_nb_wide.head()
desc_1 nb_name american asianfusion bakeries/cafes bbq breakfast_brunch caribbean chicken chinese cocktailbars diner/grill fastfood halal indpak italian japanese korean latin mediterranean pizza salad sandwiches sandwiches/delis seafood soulfood venues vietnamese total_restaurants
0 Airport-Navy Yard 01 0.0 0.0 0.00 0.0 0.0 0.0 0.00 0.00 0.0 0.00 0.00 0.0 0.0 2.0 0.0 0.0 0.0 0.0 25.00 0.0 0.0 25.00 0.0 0.0 0.0 0.0 4.0
1 Airport-Navy Yard 02 2.0 0.0 14.29 0.0 6.0 0.0 1.68 0.00 0.0 2.52 38.66 0.0 0.0 4.0 2.0 0.0 4.0 3.0 3.36 11.0 2.0 5.04 7.0 0.0 0.0 0.0 119.0
2 Airport-Navy Yard 03 4.0 2.0 0.00 0.0 0.0 0.0 6.45 0.00 2.0 12.90 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 19.35 0.0 5.0 9.68 3.0 0.0 0.0 0.0 31.0
3 Airport-Navy Yard 04 2.0 0.0 12.50 0.0 0.0 0.0 0.00 0.00 0.0 12.50 0.00 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 0.0 0.0 0.00 3.0 1.0 0.0 0.0 8.0
4 Airport-Navy Yard 05 5.0 0.0 5.94 0.0 2.0 0.0 0.00 1.98 0.0 0.99 66.34 0.0 0.0 3.0 2.0 0.0 3.0 0.0 0.00 6.0 2.0 0.00 0.0 0.0 2.0 0.0 101.0
# Create the heatmap
heatmap = alt.Chart(heatmap_data).mark_rect().encode(
    x='desc_1:N',
    y='desc_2:N',
    color='correlation:Q'
)

# Add text to each cell
text = heatmap.mark_text(baseline='middle').encode(
    text=alt.Text('correlation:Q', format='.2f'),
    color=alt.condition(
        alt.datum.correlation > 0.5, 
        alt.value('white'),
        alt.value('black')
    )
)

# Display the chart
chart = (heatmap + text).properties(width=1000, height=1000)
chart
restaurants_nb_scaled = scaler.fit_transform(restaurants_nb_wide[['bakeries/cafes', 
                   'chicken', 'chinese', 'diner/grill', 
                   'fastfood', "korean",
                   'pizza', 
                   'sandwiches/delis', 'vietnamese']])
restaurants_nb_scaled[:5]
restaurants_nb_scaled.std(axis=0)
array([1., 1., 1., 1., 1., 1., 1., 1., 1.])
restaurants_nb_scaled.mean(axis=0)
array([-1.20970537e-16, -1.43254584e-17,  1.33704278e-16, -3.34260696e-17,
       -5.73018335e-17,  1.27337408e-17,  0.00000000e+00, -6.84438567e-17,
       -6.36687039e-18])
# Perform the fit
kmeans.fit(restaurants_nb_scaled)

# Extract the labels
restaurants_nb_wide['label'] = kmeans.labels_
(
    alt.Chart(restaurants_nb_wide)
    .mark_circle()
    .encode(
        alt.X("bakeries/cafes:Q", scale=alt.Scale(zero=False)),
        alt.Y("chinese:Q", scale=alt.Scale(zero=False)),
        color=alt.Color("label:N", scale=alt.Scale(scheme="dark2")),
        tooltip=list(restaurants_nb_wide.columns),
    )
    .properties(width=800, height=600)
    .interactive()
)
restaurants_nb_wide.groupby("label").size()
label
0    81
1    26
2    18
3    97
4    26
5     1
6     1
7    22
8     7
dtype: int64
# Average restaurants per cluster
restaurants_nb_wide.groupby("label")['bakeries/cafes'].mean().sort_values()
label
5     0.000000
6     0.000000
1     0.349615
8     0.475714
0     3.369012
4     7.684231
7     7.992273
3     8.589588
2    72.334444
Name: bakeries/cafes, dtype: float64
restaurants_clusters = restaurants_nb_wide.merge(tract, how = "left", on = "nb_name")
restaurants_clusters_gdf = gpd.GeoDataFrame(restaurants_clusters, geometry = "geometry")
restaurants_clusters_gdf.explore(
    column="label",
    cmap="Set3",
    categorical=True,
    legend=True,
    tiles="CartoDB positron"
)
Make this Notebook Trusted to load map: File -> Trust Notebook
food_profile = restaurants_nb_wide.groupby("label")[['bakeries/cafes', 
                   'chicken', 'chinese', 'diner/grill', 
                   'fastfood', "korean",
                   'pizza', 
                   'sandwiches/delis', 'vietnamese']].mean().round()

food_profile
desc_1 bakeries/cafes chicken chinese diner/grill fastfood korean pizza sandwiches/delis vietnamese
label
0 3.0 2.0 3.0 1.0 71.0 0.0 4.0 3.0 0.0
1 0.0 0.0 4.0 1.0 0.0 1.0 76.0 5.0 0.0
2 72.0 1.0 3.0 0.0 0.0 0.0 12.0 4.0 0.0
3 9.0 1.0 4.0 1.0 1.0 0.0 11.0 25.0 0.0
4 8.0 1.0 50.0 0.0 1.0 0.0 14.0 7.0 0.0
5 0.0 0.0 33.0 0.0 0.0 0.0 0.0 0.0 67.0
6 0.0 0.0 33.0 0.0 0.0 33.0 33.0 0.0 0.0
7 8.0 6.0 5.0 22.0 0.0 0.0 9.0 8.0 0.0
8 0.0 50.0 11.0 1.0 7.0 0.0 10.0 7.0 0.0
restaurants_clusters_gdf.to_file('food_clusters.geojson', driver='GeoJSON', index=False)
# Number of clusters to try out
n_clusters = list(range(2, 15))

# Run kmeans for each value of k
inertias = []
for k in n_clusters:
    
    # Initialize and run
    kmeans = KMeans(n_clusters=k, n_init=10)
    kmeans.fit(restaurants_nb_scaled)
    
    # Save the "inertia"
    inertias.append(kmeans.inertia_)
    
# Plot it!
plt.plot(n_clusters, inertias, marker='o', ms=10, mfc='white', lw=4, mew=3);

from kneed import KneeLocator

# Initialize the knee algorithm
kn = KneeLocator(n_clusters, inertias, curve='convex', direction='decreasing')

# Print out the knee 
print(kn.knee)
9

Nightlife

nightlife = night_desc[["name", "desc_1"]]
nb_gdf = amenities_neighborhood[["neighborhood_name", "name", "rating"]]

nightlife_nb_gdf = nightlife.merge(nb_gdf, on = "name", how = "left")
nightlife_nb_grouped = nightlife_nb_gdf.groupby(["neighborhood_name", "desc_1"]).size().reset_index(name="count")

# Sort within each neighborhood group by count in descending order
nightlife_nb_grouped = nightlife_nb_grouped.sort_values(by=["neighborhood_name", "count"], ascending=[True, False])
nightlife_nb_grouped.head(20)
neighborhood_name desc_1 count
5 Airport-Navy Yard sportsbars 21
6 Airport-Navy Yard wine_bars 14
0 Airport-Navy Yard bars 13
2 Airport-Navy Yard cocktailbars 5
1 Airport-Navy Yard beerbar 2
4 Airport-Navy Yard pubs 2
3 Airport-Navy Yard lounges 1
9 Bustleton sportsbars 14
7 Bustleton bars 2
8 Bustleton cocktailbars 2
10 Center City East bars 42
12 Center City East cocktailbars 28
13 Center City East lounges 13
14 Center City East musicvenues 12
15 Center City East pubs 10
16 Center City East sportsbars 10
17 Center City East wine_bars 9
11 Center City East beerbar 5
18 Center City West bars 28
20 Center City West cocktailbars 19
nightlife_nb_wide = nightlife_nb_grouped.pivot(index='neighborhood_name', columns='desc_1', values='count').fillna(0).reset_index()
nightlife_nb_scaled = scaler.fit_transform(nightlife_nb_wide[['bars', 'beerbar', 'cocktailbars', 'divebars', 'lounges', 'musicvenues', 'pubs',
                                                             'sportsbars', 'wine_bars']])
nightlife_nb_scaled[:5]
nightlife_nb_scaled.std(axis=0)
array([1., 1., 1., 1., 1., 1., 1., 1., 1.])
nightlife_nb_scaled.mean(axis=0)
array([ 3.70074342e-17,  9.25185854e-18,  0.00000000e+00, -3.70074342e-17,
       -1.85037171e-17, -1.85037171e-17,  1.85037171e-17, -5.55111512e-17,
        3.70074342e-17])
# Perform the fit
kmeans.fit(nightlife_nb_scaled)

# Extract the labels
nightlife_nb_wide['label'] = kmeans.labels_
(
    alt.Chart(nightlife_nb_wide)
    .mark_circle()
    .encode(
        alt.X("bars:Q", scale=alt.Scale(zero=False)),
        alt.Y("musicvenues:Q", scale=alt.Scale(zero=False)),
        color=alt.Color("label:N", scale=alt.Scale(scheme="dark2")),
        tooltip=list(nightlife_nb_wide.columns),
    )
    .properties(width=800, height=600)
    .interactive()
)
nightlife_nb_wide.groupby("label").size()
label
0    35
1     5
2     1
3     5
4     2
dtype: int64
nightlife_clusters = nightlife_nb_wide.merge(tract, how = "left", on = "neighborhood_name")
nightlife_clusters_gdf = gpd.GeoDataFrame(nightlife_clusters, geometry = "geometry")
nightlife_clusters_gdf.explore(
    column="label",
    cmap="Set3",
    categorical=True,
    legend=True,
    tiles="CartoDB positron"
)
Make this Notebook Trusted to load map: File -> Trust Notebook

District profiles

clusters = gpd.read_file("data/clusters.geojson")
districts = gpd.read_file("data/Planning_Districts.geojson")[["DIST_NAME", "geometry"]]
districts_proj = districts.to_crs(2272)

cluster_ctr = clusters.copy()
cluster_ctr.geometry = cluster_ctr.geometry.centroid
tract_dist = cluster_ctr.sjoin(districts_proj)
district_profile = tract_dist.groupby("DIST_NAME")[['arts', 'beauty', 'education', 'entertainment', 'grocery', 
                                   'healthcare', 'historic', 'kids', 'nightlife', 'parks', 
                                   'restaurant', 'shopping']].mean()

district_profile.to_csv('district_profile.csv', index=False)

district_profile
arts beauty education entertainment grocery healthcare historic kids nightlife parks restaurant shopping
DIST_NAME
Central 5.971490 10.281469 9.449355 10.740264 8.900751 0.360445 1.637355 5.939139 14.530610 4.821984 16.498055 10.869083
Central Northeast 0.540962 25.421063 4.728159 0.540962 4.113855 0.000000 0.000000 10.192575 4.474206 3.506944 33.184762 13.296511
Lower Far Northeast 0.461133 24.714222 14.267947 2.071256 3.328687 0.000000 0.000000 10.617565 7.457475 0.000000 29.417666 7.664049
Lower North 1.301211 13.859954 6.465486 4.081751 8.906960 0.000000 0.763305 2.175683 5.643713 1.702007 47.696623 7.403306
Lower Northeast 1.315789 21.832727 13.772161 1.621478 11.047101 0.000000 0.000000 4.521085 3.503324 0.000000 35.050175 7.336160
Lower Northwest 0.359902 13.301367 13.765191 2.918120 6.960742 0.089127 0.000000 11.537796 10.622454 7.661078 27.755092 5.029132
Lower South 0.910295 3.846154 1.956815 13.644446 1.315789 0.000000 0.000000 10.312004 12.756644 1.923077 48.105357 5.229420
Lower Southwest 0.000000 13.670149 1.217391 0.500000 4.281950 0.000000 0.000000 0.909091 4.568511 5.000000 65.396386 4.456522
North 0.265252 22.470343 6.713256 1.144556 4.142242 0.000000 0.000000 1.706497 8.540435 0.662162 39.768439 14.586818
North Delaware 1.353491 19.597775 3.998782 2.436825 5.255671 0.339080 0.000000 9.639178 9.953720 3.217949 38.253892 5.953636
River Wards 1.992442 17.306946 6.730324 5.083619 7.024615 0.000000 0.000000 3.097565 11.344671 0.860001 35.939342 10.620475
South 3.374176 15.880016 5.555589 6.464443 16.421297 0.128205 0.072464 4.316686 13.612930 3.406275 23.779671 6.988249
University Southwest 2.290731 12.430496 12.013138 7.345937 4.753251 0.441176 2.500000 5.935079 8.711911 2.033190 34.115762 7.429327
Upper Far Northeast 2.757153 24.947634 11.107189 6.078872 4.107238 0.000000 0.000000 3.549484 8.009225 2.721088 29.963477 6.758640
Upper North 0.603448 22.641249 9.716666 1.808242 6.328772 0.000000 0.000000 0.835392 2.162648 0.000000 49.953789 5.949795
Upper Northwest 2.829100 21.347420 8.959644 4.127680 7.336850 0.631103 1.577909 4.020826 2.262047 0.603468 35.234163 11.069790
West 0.063131 36.707531 8.104885 0.914465 7.138201 0.000000 0.000000 4.999307 7.647183 1.850168 29.028258 3.546871
West Park 1.573720 13.094920 6.222580 9.617938 5.851341 0.000000 0.000000 4.660455 9.496647 1.428571 34.637795 13.416033
type arts beauty education entertainment grocery healthcare historic kids nightlife parks restaurant shopping
label
0 1.029109 35.168462 5.336747 3.139291 6.920761 0.318833 0.152848 3.986580 8.459712 0.830587 19.255118 15.401953
1 9.011412 6.799441 11.747031 17.897042 6.421039 0.079158 0.880370 3.914400 18.284042 2.420829 11.698266 10.846969
2 0.577297 5.976170 7.180234 5.102304 3.245692 0.505051 3.044482 25.018387 9.605864 21.546735 16.871668 1.326115
3 0.371984 16.328841 5.427419 1.220435 4.626384 0.009956 0.292610 2.990855 4.661643 0.793312 59.421667 3.854895
4 0.571429 3.822222 34.893651 0.000000 31.874603 0.000000 0.000000 7.266667 5.666667 0.571429 5.666667 9.666667